Contents
Table of Contents
Next
Gradient Descent

9.3. Recurrent Neural Network

RNN has many different variations with distinct structures. Shown in Fig. 9.10 is one of the most common architectures. On the left is the schematic of a unit. The one in the middle shows a more detailed flow of the data through the unit. The subplot on the right illustrates how the unit unfolds over time.
Figure 9.10: Structure and workflow of RNN
The symbols in Fig. 9.10 have the following meanings:
x t x t x^(t)x^{t}xt is the output at step t t ttt in the time series;
h t h t h^(t)h^{t}ht is the state of the hidden layer at t t ttt, which is determined by both x t x t x^(t)x^{t}xt and h t 1 h t 1 h^(t-1)h^{t-1}ht1 according to the above network architecture;
o t o t o^(t)o^{t}ot is the output at t t ttt, which is determined by the hidden state at the current step t t ttt;
L t L t L^(t)L^{t}Lt is the loss function at t t ttt;
y t y t y^(t)y^{t}yt is the actual output (or label) at t t ttt;
y ~ t y ~ t tilde(y)^(t)\tilde{y}^{t}y~t is the predicted output at t t ttt;
σ 1 σ 1 sigma_(1)\sigma_{1}σ1 and σ 2 σ 2 sigma_(2)\sigma_{2}σ2 are the hidden layer activation function and output activation function, respectively;
U , W U , W U,WU, WU,W, and V V VVV are the arrays that include the network weights to be predicted.
These weights, i.e., U , W U , W U,WU, WU,W, and V V VVV, are shared through time. That is, though there are "multiple units" as the RNN is unfolded over time, there is essentially one unit. Thus, the backpropagation of RNN updates the weights of the same unit.
The above symbols were given without discussing the shape of the input x x xxx, which is handled as one sample here. In many cases, x x xxx has multiple attributes and thus needs to be formulated using a 1D array. In the following deductions for backpropagation, we will represent x x xxx as a column array with J J JJJ elements (for the J J JJJ attributes). In this way, the derived equations can be easily used for coding. The output y y yyy (actual) and y ~ y ~ tilde(y)\tilde{y}y~ (predicted) are 1D arrays with a shape of K × 1 K × 1 K xx1K \times 1K×1. In implementation, we will still need to determine the shape of h h vec(h)\vec{h}h. Let us assume h h vec(h)\vec{h}h has a shape of N × 1 N × 1 N xx1N \times 1N×1. The shapes of other weight arrays can be determined by those of x , y x , y vec(x), vec(y)\vec{x}, \vec{y}x,y, and h h vec(h)\vec{h}h. The data flow, especially the change of the shapes of arrays for data and weights, can be tracked in the following deductions.

9.3.1. Forward Pass

Let us first work on the forward pass. The following three operations will turn the input x J × 1 x J × 1 vec(x)_(J xx1)\vec{x}_{J \times 1}xJ×1 into y ~ y ~ tilde(y)\tilde{y}y~.
(9.12) h N × 1 t = σ 1 ( z N × 1 t ) = σ 1 ( U ¯ N × J x J × 1 t + W ¯ N × N h N × 1 t 1 + b N × 1 ) (9.13) o K × 1 t = V ¯ K × N h N × 1 t + c K × 1 (9.14) y ~ K × 1 t = σ 2 ( o K × 1 t ) = σ 2 ( V ¯ K × N h N × 1 t + c K × 1 ) (9.12) h N × 1 t = σ 1 z N × 1 t = σ 1 U ¯ N × J x J × 1 t + W ¯ N × N h N × 1 t 1 + b N × 1 (9.13) o K × 1 t = V ¯ K × N h N × 1 t + c K × 1 (9.14) y ~ K × 1 t = σ 2 o K × 1 t = σ 2 V ¯ K × N h N × 1 t + c K × 1 {:[(9.12) vec(h)_(N xx1)^(t)=sigma_(1)( vec(z)_(N xx1)^(t))=sigma_(1)( bar(U)_(N xx J)*x_(J xx1)^(t)+ bar(W)_(N xx N)* vec(h)_(N xx1)^(t-1)+ vec(b)_(N xx1))],[(9.13) vec(o)_(K xx1)^(t)= bar(V)_(K xx N)* vec(h)_(N xx1)^(t)+ vec(c)_(K xx1)],[(9.14) tilde(vec(y))_(K xx1)^(t)=sigma_(2)( vec(o)_(K xx1)^(t))=sigma_(2)( bar(V)_(K xx N)* vec(h)_(N xx1)^(t)+ vec(c)_(K xx1))]:}\begin{gather*} \vec{h}_{N \times 1}^{t}=\sigma_{1}\left(\vec{z}_{N \times 1}^{t}\right)=\sigma_{1}\left(\bar{U}_{N \times J} \cdot x_{J \times 1}^{t}+\bar{W}_{N \times N} \cdot \vec{h}_{N \times 1}^{t-1}+\vec{b}_{N \times 1}\right) \tag{9.12}\\ \vec{o}_{K \times 1}^{t}=\bar{V}_{K \times N} \cdot \vec{h}_{N \times 1}^{t}+\vec{c}_{K \times 1} \tag{9.13}\\ \tilde{\vec{y}}_{K \times 1}^{t}=\sigma_{2}\left(\vec{o}_{K \times 1}^{t}\right)=\sigma_{2}\left(\bar{V}_{K \times N} \cdot \vec{h}_{N \times 1}^{t}+\vec{c}_{K \times 1}\right) \tag{9.14} \end{gather*}(9.12)hN×1t=σ1(zN×1t)=σ1(U¯N×JxJ×1t+W¯N×NhN×1t1+bN×1)(9.13)oK×1t=V¯K×NhN×1t+cK×1(9.14)y~K×1t=σ2(oK×1t)=σ2(V¯K×NhN×1t+cK×1)
where σ 1 σ 1 sigma_(1)\sigma_{1}σ1 and σ 2 σ 2 sigma_(2)\sigma_{2}σ2 are the two activation functions used for the hidden layer and output layer, respectively.
The loss function at time step t t ℓ^(t)\ell^{t}t measures the distance between y ~ t y ~ t tilde(vec(y))^(t)\tilde{\vec{y}}^{t}y~t and y t y t vec(y)^(t)\vec{y}^{t}yt. For example, the log likelihood function is usually used for classification tasks, and the following mean square root error function is used for regression tasks.
(9.15) = t = 1 τ t = t = 1 τ 1 2 ( y ~ K × 1 t y K × 1 t ) 2 (9.15) = t = 1 τ t = t = 1 τ 1 2 y ~ K × 1 t y K × 1 t 2 {:(9.15)ℓ=sum_(t=1)^(tau)ℓ^(t)=sum_(t=1)^(tau)(1)/(2)( tilde(vec(y))_(K xx1)^(t)- vec(y)_(K xx1)^(t))^(2):}\begin{equation*} \ell=\sum_{t=1}^{\tau} \ell^{t}=\sum_{t=1}^{\tau} \frac{1}{2}\left(\tilde{\vec{y}}_{K \times 1}^{t}-\vec{y}_{K \times 1}^{t}\right)^{2} \tag{9.15} \end{equation*}(9.15)=t=1τt=t=1τ12(y~K×1tyK×1t)2

9.3.2. Backward Pass

RNN's backward pass in which the network weights are updated via methods such as a gradient descent method is termed Backpropagation Through Time (BPTT). The deduction of backpropagation rules, i.e., equations for the backpropagation of the gradients from the difference between the predicted and actual outputs, depends on the loss function and activation functions.
In the following, we will derive the general backpropagation equations. In this deduction, we will use the mean square root error for loss, tanh for σ 1 σ 1 sigma_(1)\sigma_{1}σ1, and Sigmoid for σ 2 σ 2 sigma_(2)\sigma_{2}σ2 to show more detailed equations that can be used for implementations. Accordingly, the three functions have the following derivatives, which are needed in the backpropagation.
(9.16) t y ~ t = [ 1 2 ( y ~ t y t ) 2 ] y ~ t = y ~ t y t (9.17) σ 2 = σ 2 y ~ = y ~ t ( 1 y ~ t ) (9.18) σ 1 = σ 1 h = 1 ( h t ) 2 (9.16) t y ~ t = 1 2 y ~ t y t 2 y ~ t = y ~ t y t (9.17) σ 2 = σ 2 y ~ = y ~ t 1 y ~ t (9.18) σ 1 = σ 1 h = 1 h t 2 {:[(9.16)(delℓ^(t))/(del tilde(vec(y))^(t))=(del[(1)/(2)( tilde(vec(y))^(t)- vec(y)^(t))^(2)])/(del tilde(vec(y))^(t))= tilde(vec(y))^(t)- vec(y)^(t)],[(9.17)sigma_(2)^(')=(delsigma_(2))/(del( tilde(vec(y))))= tilde(vec(y))^(t)o.(1- tilde(vec(y))^(t))],[(9.18)sigma_(1)^(')=(delsigma_(1))/(del( vec(h)))=1-( vec(h)^(t))^(2)]:}\begin{gather*} \frac{\partial \ell^{t}}{\partial \tilde{\vec{y}}^{t}}=\frac{\partial\left[\frac{1}{2}\left(\tilde{\vec{y}}^{t}-\vec{y}^{t}\right)^{2}\right]}{\partial \tilde{\vec{y}}^{t}}=\tilde{\vec{y}}^{t}-\vec{y}^{t} \tag{9.16}\\ \sigma_{2}^{\prime}=\frac{\partial \sigma_{2}}{\partial \tilde{\vec{y}}}=\tilde{\vec{y}}^{t} \odot\left(1-\tilde{\vec{y}}^{t}\right) \tag{9.17}\\ \sigma_{1}^{\prime}=\frac{\partial \sigma_{1}}{\partial \vec{h}}=1-\left(\vec{h}^{t}\right)^{2} \tag{9.18} \end{gather*}(9.16)ty~t=[12(y~tyt)2]y~t=y~tyt(9.17)σ2=σ2y~=y~t(1y~t)(9.18)σ1=σ1h=1(ht)2
The gradient of V ¯ K × N V ¯ K × N bar(V)_(K xx N)\bar{V}_{K \times N}V¯K×N and c K × 1 c K × 1 vec(c)_(K xx1)\vec{c}_{K \times 1}cK×1 can be simply calculated as follows:
(9.19) c K × 1 = t = 1 τ t c = t = 1 τ [ ( y ~ t y t ) y ~ t ( 1 y ~ t ) ] (9.20) V ¯ K × N = t = 1 τ t V ¯ = t = 1 τ [ ( y ~ t y t ) y ~ t ( 1 y ~ t ) ] K × 1 ( h t ) 1 × N T (9.19) c K × 1 = t = 1 τ t c = t = 1 τ y ~ t y t y ~ t 1 y ~ t (9.20) V ¯ K × N = t = 1 τ t V ¯ = t = 1 τ y ~ t y t y ~ t 1 y ~ t K × 1 h t 1 × N T {:[(9.19)(delℓ)/(del( vec(c)))_(K xx1)=sum_(t=1)^(tau)(delℓ^(t))/(del( vec(c)))=sum_(t=1)^(tau)[( tilde(vec(y))^(t)- vec(y)^(t))o. tilde(vec(y))^(t)o.(1- tilde(vec(y))^(t))]],[(9.20)(delℓ)/(del bar(V)_(K xx N))=sum_(t=1)^(tau)(delℓ^(t))/(del( bar(V)))=sum_(t=1)^(tau)[( tilde(vec(y))^(t)- vec(y)^(t))o. tilde(vec(y))^(t)o.(1- tilde(vec(y))^(t))]_(K xx1)*(h^(t))_(1xx N)^(T)]:}\begin{gather*} \frac{\partial \ell}{\partial \vec{c}}{ }_{K \times 1}=\sum_{t=1}^{\tau} \frac{\partial \ell^{t}}{\partial \vec{c}}=\sum_{t=1}^{\tau}\left[\left(\tilde{\vec{y}}^{t}-\vec{y}^{t}\right) \odot \tilde{\vec{y}}^{t} \odot\left(1-\tilde{\vec{y}}^{t}\right)\right] \tag{9.19}\\ \frac{\partial \ell}{\partial \bar{V}_{K \times N}}=\sum_{t=1}^{\tau} \frac{\partial \ell^{t}}{\partial \bar{V}}=\sum_{t=1}^{\tau}\left[\left(\tilde{\vec{y}}^{t}-\vec{y}^{t}\right) \odot \tilde{\vec{y}}^{t} \odot\left(1-\tilde{\vec{y}}^{t}\right)\right]_{K \times 1} \cdot\left(h^{t}\right)_{1 \times N}^{T} \tag{9.20} \end{gather*}(9.19)cK×1=t=1τtc=t=1τ[(y~tyt)y~t(1y~t)](9.20)V¯K×N=t=1τtV¯=t=1τ[(y~tyt)y~t(1y~t)]K×1(ht)1×NT
The calculation of the gradients for W ¯ , U ¯ W ¯ , U ¯ bar(W), bar(U)\bar{W}, \bar{U}W¯,U¯, and b b vec(b)\vec{b}b is more complicated. As illustrated in the unfolded RNN, the gradient at a time step t t ttt comes from two parts: the loss caused by y ~ ~ t y t y ~ ~ t y t tilde(tilde(y))^(t)- vec(y)^(t)\tilde{\tilde{y}}^{t}-\vec{y}^{t}y~~tyt at the current step and that from the next time step via the hidden layer connection ( h t h t vec(h)^(t)\vec{h}^{t}ht and h t + 1 h t + 1 vec(h)^(t+1)\vec{h}^{t+1}ht+1 ). Thus, the gradient calculation will need to start from the last time step τ τ tau\tauτ and then move backward (time) step by step.
W ¯ , U ¯ W ¯ , U ¯ bar(W), bar(U)\bar{W}, \bar{U}W¯,U¯, and b b vec(b)\vec{b}b are contained in the function for h t h t vec(h)^(t)\vec{h}^{t}ht. Thus, the calculation of the gradients for these parameters will also involve h t h t (delℓ)/(del vec(h)^(t))\frac{\partial \ell}{\partial \vec{h}^{t}}ht when applying the chain rule. For example, for W ¯ W ¯ bar(W)\bar{W}W¯, we have W ¯ N × N = h N × 1 h W ¯ 1 × N W ¯ N × N = h N × 1 h W ¯ 1 × N (delℓ)/(del( bar(W)))_(N xx N)=(delℓ)/(del( vec(h)))_(N xx1)*(del( vec(h)))/(del( bar(W)))_(1xx N)\frac{\partial \ell}{\partial \bar{W}}{ }_{N \times N}=\frac{\partial \ell}{\partial \vec{h}}{ }_{N \times 1} \cdot \frac{\partial \vec{h}}{\partial \bar{W}}{ }_{1 \times N}W¯N×N=hN×1hW¯1×N. To simplify the deduction, we can define the following intermediate variable:
(9.21) δ N × 1 t = h N × 1 t (9.21) δ N × 1 t = h N × 1 t {:(9.21) vec(delta)_(N xx1)^(t)=(delℓ)/(del vec(h)_(N xx1)^(t)):}\begin{equation*} \vec{\delta}_{N \times 1}^{t}=\frac{\partial \ell}{\partial \vec{h}_{N \times 1}^{t}} \tag{9.21} \end{equation*}(9.21)δN×1t=hN×1t
Then, we can expand δ t δ t vec(delta)^(t)\vec{\delta}^{t}δt to obtain the recursive relationship between steps.
(9.22) δ N × 1 t = o t h t N × K o t K × 1 + h t + 1 h t N × N h t + 1 N × 1 = V ¯ N × K T ( y ~ t y t ) K × 1 + W ¯ N × N T diag [ 1 ( h t + 1 ) N × 1 2 ] N × N δ N × 1 t + 1 (9.22) δ N × 1 t = o t h t N × K o t K × 1 + h t + 1 h t N × N h t + 1 N × 1 = V ¯ N × K T y ~ t y t K × 1 + W ¯ N × N T diag 1 h t + 1 N × 1 2 N × N δ N × 1 t + 1 {:[(9.22) vec(delta)_(N xx1)^(t)=(del vec(o)^(t))/(del vec(h)^(t))_(N xx K)*(delℓ)/(del vec(o)^(t))_(K xx1)+(del vec(h)^(t+1))/(del vec(h)^(t))_(N xx N)*(delℓ)/(del vec(h)^(t+1))_(N xx1)],[= bar(V)_(N xx K)^(T)*( tilde(vec(y))^(t)- vec(y)^(t))_(K xx1)+ bar(W)_(N xx N)^(T)*diag[1-( vec(h)^(t+1))_(N xx1)^(2)]_(N xx N)* vec(delta)_(N xx1)^(t+1)]:}\begin{align*} \vec{\delta}_{N \times 1}^{t} & =\frac{\partial \vec{o}^{t}}{\partial \vec{h}^{t}}{ }_{N \times K} \cdot \frac{\partial \ell}{\partial \vec{o}^{t}}{ }_{K \times 1}+\frac{\partial \vec{h}^{t+1}}{\partial \vec{h}^{t}}{ }_{N \times N} \cdot \frac{\partial \ell}{\partial \vec{h}^{t+1}}{ }_{N \times 1} \tag{9.22}\\ & =\bar{V}_{N \times K}^{T} \cdot\left(\tilde{\vec{y}}^{t}-\vec{y}^{t}\right)_{K \times 1}+\bar{W}_{N \times N}^{T} \cdot \operatorname{diag}\left[1-\left(\vec{h}^{t+1}\right)_{N \times 1}^{2}\right]_{N \times N} \cdot \vec{\delta}_{N \times 1}^{t+1} \end{align*}(9.22)δN×1t=othtN×KotK×1+ht+1htN×Nht+1N×1=V¯N×KT(y~tyt)K×1+W¯N×NTdiag[1(ht+1)N×12]N×NδN×1t+1
where the function diag ( h ) diag ( h ) diag( vec(h))\operatorname{diag}(\vec{h})diag(h) converts a vector h h vec(h)\vec{h}h with a shape of ( N ( N (N(N(N, ) o r ( N , 1 ) ) o r ( N , 1 ) )or(N,1)) or (N, 1))or(N,1) to a 2D array with a shape of ( N , N ) ( N , N ) (N,N)(N, N)(N,N) whose diagonal is h h vec(h)\vec{h}h.
δ t + 1 δ t + 1 vec(delta)^(t+1)\vec{\delta}^{t+1}δt+1 has a special case when t = τ t = τ t=taut=\taut=τ. At this last step, there is no gradient coming from the next step. Thus, it can be calculated directly.
(9.23) δ N × 1 τ = ( o τ h τ ) N × K O T K × 1 = V N × K T ( y ~ τ y τ ) K × 1 (9.23) δ N × 1 τ = o τ h τ N × K O T K × 1 = V N × K T y ~ τ y τ K × 1 {:(9.23) vec(delta)_(N xx1)^(tau)=((del vec(o)^(tau))/(del vec(h)^(tau)))_(N xx K)*(delℓ)/(delO^(T))_(K xx1)=V_(N xx K)^(T)*( tilde(y)^(tau)-y^(tau))_(K xx1):}\begin{equation*} \vec{\delta}_{N \times 1}^{\tau}=\left(\frac{\partial \vec{o}^{\tau}}{\partial \vec{h}^{\tau}}\right)_{N \times K} \cdot \frac{\partial \ell}{\partial O^{T}}{ }_{K \times 1}=V_{N \times K}^{T} \cdot\left(\tilde{y}^{\tau}-y^{\tau}\right)_{K \times 1} \tag{9.23} \end{equation*}(9.23)δN×1τ=(oτhτ)N×KOTK×1=VN×KT(y~τyτ)K×1
With δ t δ t vec(delta)^(t)\vec{\delta}^{t}δt, the gradients of W ¯ , U ¯ W ¯ , U ¯ bar(W), bar(U)\bar{W}, \bar{U}W¯,U¯, and b b vec(b)\vec{b}b can be calculated as
(9.24) W ¯ N × N = t = 1 τ diag [ 1 ( h t ) 2 ] N × N δ N × 1 t ( h t 1 ) 1 × N T (9.25) U ¯ N × J = t = 1 τ diag [ 1 ( h t ) 2 ] N × N δ N × 1 t ( x t ) 1 × J T (9.26) b N × 1 = t = 1 τ diag [ 1 ( h t ) 2 ] N × N δ N × 1 t (9.24) W ¯ N × N = t = 1 τ diag 1 h t 2 N × N δ N × 1 t h t 1 1 × N T (9.25) U ¯ N × J = t = 1 τ diag 1 h t 2 N × N δ N × 1 t x t 1 × J T (9.26) b N × 1 = t = 1 τ diag 1 h t 2 N × N δ N × 1 t {:[(9.24)(delℓ)/(del( bar(W)))_(N xx N)=sum_(t=1)^(tau)diag[1-( vec(h)^(t))^(2)]_(N xx N)* vec(delta)_(N xx1)^(t)*( vec(h)^(t-1))_(1xx N)^(T)],[(9.25)(delℓ)/(del bar(U)_(N xx J))=sum_(t=1)^(tau)diag[1-( vec(h)^(t))^(2)]_(N xx N)* vec(delta)_(N xx1)^(t)*( vec(x)^(t))_(1xx J)^(T)],[(9.26)(delℓ)/(del vec(b)_(N xx1))=sum_(t=1)^(tau)diag[1-( vec(h)^(t))^(2)]_(N xx N)* vec(delta)_(N xx1)^(t)]:}\begin{gather*} \frac{\partial \ell}{\partial \bar{W}}{ }_{N \times N}=\sum_{t=1}^{\tau} \operatorname{diag}\left[1-\left(\vec{h}^{t}\right)^{2}\right]_{N \times N} \cdot \vec{\delta}_{N \times 1}^{t} \cdot\left(\vec{h}^{t-1}\right)_{1 \times N}^{T} \tag{9.24}\\ \frac{\partial \ell}{\partial \bar{U}_{N \times J}}=\sum_{t=1}^{\tau} \operatorname{diag}\left[1-\left(\vec{h}^{t}\right)^{2}\right]_{N \times N} \cdot \vec{\delta}_{N \times 1}^{t} \cdot\left(\vec{x}^{t}\right)_{1 \times J}^{T} \tag{9.25}\\ \frac{\partial \ell}{\partial \vec{b}_{N \times 1}}=\sum_{t=1}^{\tau} \operatorname{diag}\left[1-\left(\vec{h}^{t}\right)^{2}\right]_{N \times N} \cdot \vec{\delta}_{N \times 1}^{t} \tag{9.26} \end{gather*}(9.24)W¯N×N=t=1τdiag[1(ht)2]N×NδN×1t(ht1)1×NT(9.25)U¯N×J=t=1τdiag[1(ht)2]N×NδN×1t(xt)1×JT(9.26)bN×1=t=1τdiag[1(ht)2]N×NδN×1t
When implementing the above backward pass, we can start from the last time step. First, we calculate c , V ¯ c , V ¯ (delℓ)/(del( vec(c))),(delℓ)/(del( bar(V)))\frac{\partial \ell}{\partial \vec{c}}, \frac{\partial \ell}{\partial \bar{V}}c,V¯, and δ t δ t vec(delta)^(t)\vec{\delta}^{t}δt. Then, we can compute W ¯ , U W ¯ , U (delℓ)/(del( bar(W))),(delℓ)/(del U)\frac{\partial \ell}{\partial \bar{W}}, \frac{\partial \ell}{\partial U}W¯,U, and b b (delℓ)/(del( vec(b)))\frac{\partial \ell}{\partial \vec{b}}b.
All the model parameters, i.e., c V ¯ , W ¯ , U ¯ c V ¯ , W ¯ , U ¯ vec(c) bar(V), bar(W), bar(U)\vec{c} \bar{V}, \bar{W}, \bar{U}cV¯,W¯,U¯, and b b vec(b)\vec{b}b, which can be represented using a general parameter array, i.e., θ θ theta\thetaθ, can be updated using the gradient descent:
(9.27) θ = θ α θ (9.27) θ = θ α θ {:(9.27)theta=theta-alpha*(delℓ)/(del theta):}\begin{equation*} \theta=\theta-\alpha \cdot \frac{\partial \ell}{\partial \theta} \tag{9.27} \end{equation*}(9.27)θ=θαθ
where α α alpha\alphaα is the learning rate.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models